252 research outputs found
Dynamic Data Structures for Document Collections and Graphs
In the dynamic indexing problem, we must maintain a changing collection of
text documents so that we can efficiently support insertions, deletions, and
pattern matching queries. We are especially interested in developing efficient
data structures that store and query the documents in compressed form. All
previous compressed solutions to this problem rely on answering rank and select
queries on a dynamic sequence of symbols. Because of the lower bound in
[Fredman and Saks, 1989], answering rank queries presents a bottleneck in
compressed dynamic indexing. In this paper we show how this lower bound can be
circumvented using our new framework. We demonstrate that the gap between
static and dynamic variants of the indexing problem can be almost closed. Our
method is based on a novel framework for adding dynamism to static compressed
data structures. Our framework also applies more generally to dynamizing other
problems. We show, for example, how our framework can be applied to develop
compressed representations of dynamic graphs and binary relations
A Bulk-Parallel Priority Queue in External Memory with STXXL
We propose the design and an implementation of a bulk-parallel external
memory priority queue to take advantage of both shared-memory parallelism and
high external memory transfer speeds to parallel disks. To achieve higher
performance by decoupling item insertions and extractions, we offer two
parallelization interfaces: one using "bulk" sequences, the other by defining
"limit" items. In the design, we discuss how to parallelize insertions using
multiple heaps, and how to calculate a dynamic prediction sequence to prefetch
blocks and apply parallel multiway merge for extraction. Our experimental
results show that in the selected benchmarks the priority queue reaches 75% of
the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the
speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape
Competitive Parallel Disk Prefetching and Buffer Management
We provide a competitive analysis framework for online prefetching and buffer management
algorithms in parallel I/O systems, using a read-once model of block references. This has
widespread applicability to key I/O-bound applications such as external merging and concurrent
playback of multiple video streams. Two realistic lookahead models, global lookahead and local
lookahead, are defined. Algorithms NOM and GREED based on these two forms of lookahead
are analyzed for shared buffer and distributed buffer configurations, both of which occur frequently
in existing systems. An important aspect of our work is that we show how to implement
both the models of lookahead in practice using the simple techniques of forecasting and flushing.
Given a
-disk parallel I/O system and a globally shared I/O buffer that can hold upto
disk
blocks, we derive a lower bound of
on the competitive ratio of any deterministic online
prefetching algorithm with
lookahead. NOM is shown to match the lower bound using
global
-block lookahead. In contrast, using only local lookahead results in an
competitive
ratio. When the buffer is distributed into
portions of
blocks each, the algorithm
GREED based on local lookahead is shown to be optimal, and NOM is within a constant factor
of optimal. Thus we provide a theoretical basis for the intuition that global lookahead is more
valuable for prefetching in the case of a shared buffer configuration whereas it is enough to provide
local lookahead in case of the distributed configuration. Finally, we analyze the performance
of these algorithms for reference strings generated by a uniformly-random stochastic process and
we show that they achieve the minimal expected number of I/Os. These results also give bounds
on the worst-case expected performance of algorithms which employ randomization in the data
layout
Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays
Local moments are used for local regression, to compute statistical measures
such as sums, averages, and standard deviations, and to approximate probability
distributions. We consider the case where the data source is a very large I/O
array of size n and we want to compute the first N local moments, for some
constant N. Without precomputation, this requires O(n) time. We develop a
sequence of algorithms of increasing sophistication that use precomputation and
additional buffer space to speed up queries. The simpler algorithms partition
the I/O array into consecutive ranges called bins, and they are applicable not
only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM,
etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A
more sophisticated approach uses hierarchical buffering and has a logarithmic
time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b.
Using Overlapped Bin Buffering, we show that only a single buffer is needed, as
with wavelet-based algorithms, but using much less storage. Applications exist
in multidimensional and statistical databases over massive data sets,
interactive image processing, and visualization
Cylindrical Static and Kinetic Binary Space Partitions
P. K. Agarwal, L. Guibas, T. M. Murali, and J. S. Vitter. “Cylindrical Static and Kinetic Binary Space Partitions,” Computational Geometry, 16(2), 2000, 103–127. An extended abstract appears in Proceedings of the 13th Annual ACM Symposium on Computational Geometry (SCG ’97), Nice, France, June 1997, 39–48
Cylindrical Static and Kinetic Binary Space Partitions
P. K. Agarwal, L. Guibas, T. M. Murali, and J. S. Vitter. “Cylindrical Static and Kinetic Binary Space Partitions,” Computational Geometry, 16(2), 2000, 103–127. An extended abstract appears in Proceedings of the 13th Annual ACM Symposium on Computational Geometry (SCG ’97), Nice, France, June 1997, 39–48
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure
in indexed string matching. It can be used to speed up searching using the
suffix array (SA) and provides an implicit representation of the topology of an
underlying suffix tree. The LCP array of a string of length can be
represented as an array of length words, or, in the presence of the SA, as
a bit vector of bits plus asymptotically negligible support data
structures. External memory construction algorithms for the LCP array have been
proposed, but those proposed so far have a space requirement of words
(i.e. bits) in external memory. This space requirement is in some
practical cases prohibitively expensive. We present an external memory
algorithm for constructing the bit version of the LCP array which uses
bits of additional space in external memory when given a
(compressed) BWT with alphabet size and a sampled inverse suffix array
at sampling rate . This is often a significant space gain in
practice where is usually much smaller than or even constant. We
also consider the case of computing succinct LCP arrays for circular strings
When Random Sampling Preserves Privacy
Abstract. Many organizations such as the U.S. Census publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques and then a small sample is released so as to enable “do-it-yourself ” calculations. This paper investigates the privacy of the second step of this process: sampling. We observe that rare values – values that occur with low frequency in the table – can be problematic from a privacy perspective. To our knowledge, this is the first work that quantitatively examines the relationship between the number of rare values in a table and the privacy in a released random sample. If we require ɛ-privacy (where the larger ɛ is, the worse the privacy guarantee) with probability at least 1 − δ, we say that 1 a value is rare if it occurs in at most Õ ( ) rows of the table (ignoring log ɛ factors). If there are no rare values, then we establish a direct connection between sample size that is safe to release and privacy. Specifically, if we select each row of the table with probability at most ɛ then the sample is O(ɛ)-private with high probability. In the case that there are t rare values, then the sample is Õ(ɛδ/t)-private with probability at least 1 − δ.
- …